Incorporating Global Information into Supervised Learning for Chinese Word Segmentation
نویسنده
چکیده
This paper presents a novel approach to Chinese word segmentation (CWS) that attempts to utilize global information (GI) such as co-occurrence of sub-sequences and outputs of unsupervised segmentation in the whole text for further enhancement of the state-of-the-art performance of conditional random fields (CRF) learning. In the existing work of CWS, supervised and unsupervised learning seldom joined, and thus strengthened, with each other. Our attempt here is to integrate unsupervised learning into supervised learning for CWS. Our experimental results show that character-based CRF framework can effectively make use of global information for performance enhancement on top of the best existing results.
منابع مشابه
Exploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation
This paper presents a novel approach to improve Chinese word segmentation (CWS) that attempts to utilize unlabeled data such as training and test data without annotation for further enhancement of the state-of-the-art performance of supervised learning. The lexical information plays the role of information transformation from unlabeled text to supervised learning model. Four types of unsupervis...
متن کاملSemi-supervised Chinese Word Segmentation for CLP2012
Chinese word segmentation (CWS) lays the essential foundation for Mandarin Chinese analysis. However, its performance is always limited by the identification of unknown words, especially for short text such as Microblog. While local context are helpless in handling unknown words, global context do manifest enough contextual information, and could be used to guide CWS process. Based on this moti...
متن کاملImproving Chinese Word Segmentation with Description Length Gain
Supervised and unsupervised learning has seldom joined with and thus lend strength to each other in the field of Chinese word segmentation (CWS). This paper presents a novel approach to CWS that utilizes description length gain (DLG), an empirical goodness measure for unsupervised word discovery, to enhance the segmentation performance of conditional random field (CRF) learning. Specifically, w...
متن کاملSemi-Supervised Learning for Natural Language
Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available “for free” in large quantities. Unlabeled data has shown promise in improving the performance of a number of tasks, e.g. word sense disambiguation, informat...
متن کاملLong Short-Term Memory Neural Networks for Chinese Word Segmentation
Currently most of state-of-the-art methods for Chinese word segmentation are based on supervised learning, whose features aremostly extracted from a local context. Thesemethods cannot utilize the long distance information which is also crucial for word segmentation. In this paper, we propose a novel neural network model for Chinese word segmentation, which adopts the long short-term memory (LST...
متن کامل